Tabi: An Efficient Multi-Level Inference System for Large Language Models | Proceedings of the Eighteenth European Conference on Computer Systems

小模型协助推理，已读，未整理

[2311.15566] SpotServe: Serving Generative Large Language Models on Preemptible Instances

抢占式节点上的LLM推理，可能和loongserve有协作的地方

Meta OSDI的两个工作

MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale | USENIX

调度器的工作，听说质量非常好

Optimizing Resource Allocation in Hyperscale Datacenters: Scalability, Usability, and Experiences | USENIX

资源分配的工作

Mooncake: Kimi’s KVCache-centric Architecture for LLM Serving

月之暗面的工作

Mooncake (4): 月饼的皮和馅是怎样制成的，Mooncake 传输引擎开源以及后续的计划 - 知乎

传输部分的优化

市场理论

高级微观经济学系列：General Equilibrium

贝叶斯优化

DLRover

蚂蚁集团论文解析

[2401.11181] Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads

对于不同的请求的一个优化

Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

alibaba长文本

[2406.17565] MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool

华为PD 分离的工作

[2405.07719] USP: A Unified Sequence Parallelism Approach for Long Context Generative AI

方佳瑞关于长文本的工作，里面包含了带宽需求分析

hao-ai-lab/vllm-ltr: [NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank

预测什么时候完成LLM

penghuima/awesome-serverless-papers: Collect papers about serverless computing research

serverless 论文集

Pyxis: Scheduling Mixed Tasks in Disaggregated Datacenters | IEEE Journals & Magazine | IEEE Xplore

Jinxin组论文

论文与代码阅读笔记

Jin Xin 组PhD博士，个人主页有很多相关的论文笔记

OSDI 2024 阅读评述连载（一）

OSDI 2024 阅读评述连载（二）

Session 4 DL

Session 6 Cloud Computing

OSDI 2024 阅读评述连载（三）

Session 11 ML Scheduling

[2406.19707] InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management

线性代数

MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning, Spring 2018 - YouTube

[2412.03213] ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression

InfiniGen的后续工作

[原创长文]2024.10-开源大模型推理引擎现状及常见推理优化方法 - 知乎

长文本稀疏化工作

DuoAttention

[2410.05076] TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention

[2407.02490] MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

NIPS 2024 Star Attention Pro Max版本

SOSP 2024 阅读评述连载（一）

SOSP 2024 阅读评述连载（二）

Session 3 Deep Learning and Training

SOSP 2024 阅读评述连载（三）

Session 6 Serverless

SOSP 2024 阅读评述连载（四）

Session 9 ML Serving

results matching ""

No results matching ""